Developer CD Series 1994 November: Tool Chest

home *** CD-ROM | disk | FTP | other *** search

/ Developer CD Series 1994 November: Tool Chest / Dev.CD Nov 94.toast / Tool Chest / Development Tools & Languages / • Other Platforms / PCCTS / Documentation / NOTES.newbie < prev next >

Wrap

Text File | 1994-09-14 | 122.7 KB | 3,183 lines | [TEXT/MPS ]

29 August 94 Version 1.21 of pccts This help file is available via anonymous FTP at: Node: everest.ee.umn.edu [128.101.144.112] File: /pub/pccts/1.22/NOTES.newbie Mirror sites for pccts: Europe: Node: ftp.th-darmstadt.de [130.83.55.75] Directory: pub/programming/languages/compiler-compiler/pccts According to the FAQ this is updated daily. Node: src.doc.ic.ac.uk Directory: /computing/programming/languages/tools/pccts Unable to reach this node from my site. Also: Node: ftp.uu.net Directory: languages/tools/pccts Mail corrections or additions to moog@polhode.com =============================================================================== Miscellaneous ------------------------------------------------------------------------------- 1. NEVER choose rule names, #token names, #lexclass names, #errclass names, etc. which coincide with the reserved words of your C or C++ compiler. Be awake to name collisions with your favorite libraries and #include files. One can only imagine the results of definitions like: #token FILE "file" const: "[0-9]*" 2. Tokens begin with uppercase characters. Rules begin with lowercase characters. 3. When passing the name of the start rule to the ANTLR macro don't forget to code the trailing function arguments: /* Not using ASTs */ ANTLR (grammar(),stdin); /* Using ASTs */ ANTLR (grammar(&ASTroot),stdin); /* *** Wrong *** */ ANTLR (grammar,stdin); 4. When you see a syntax error message that has quotation marks on separate lines: line 1: syntax error at " " missing ID that probably means that the offending element contains a newline. Page 2 5. Even if your C compiler does not support C++ style comments, you can use them in the *non-action* portion of the ANTLR source code. Inside an action (i.e. <<...>> ) you have to obey the comment conventions of your compiler. 6. To place main() in a ".c" file rather than a grammar file (".g") place: #include "stdpccts.h" before invoking the ANTRLR macro. Contributed by N.F. Ross. 7. ANTLR counts a line which is continued across a newline using the backslash convention as a single line. For example: #header << #define abcd alpha\ beta\ gamma\ delta >> This will cause line numbers in ANTLR error messages to be off by 3 compared to most text editors. 8. The Purdue Computer Science Department maintains a WWW directory which includes a pccts page: URL http://tempest.ecn.purdue.edu:8001/ Page 3 9. Suppose one wants to parse files that "include" other files. The code in ANTLR (antlr.g) for handling #tokdefs statements demonstrates how this may be done. grammar: ... | "#tokdefs" QuotedTerm <<{ zzantlr_state st; /* defined in antlr.h */ struct zzdlg_state dst; /* defined in dlgdef.h */ FILE *f; UserTokenDefsFile = mystrdup(LATEXT(1)); zzsave_antlr_state(&st); zzsave_dlg_state(&dst); f = fopen(StripQuotes(LATEXT(1)),"r"); if ( f==NULL ) { warn(eMsg1("cannot open token defs file '%s'", LATEXT(1)+1));} else { ANTLRm( enum_file(), f, PARSE_ENUM_FILE); UserDefdTokens = 1; } zzrestore_antlr_state(&st); zzrestore_dlg_state(&dst); }>> The code uses zzsave_antlr_state() and zzsave_dlg_state() to save the state of the current parse. The ANTLRm macro specifies a starting rule for ANTLR of "enum_file" and starts DLG in the PARSE_ENUM_FILE state rather than the default state (which is the current state - whatever it might be). Because enum_file() is called without any arguments it appears that enum_file() does not use ASTs nor pass back any attributes. Contributed by Terence J. Parr. 10. If an action becomes too large then it will overflow an ANTLR buffer ("... error: action buffer overflow: size 4000"). If you don't wish to recompile ANTLR with a larger value for ZZLEXBUFSIZE you can put the action in an include file and then place a #include in the action Suggested by David Seidel (dseidel@delphi.com). Splitting an action into two smaller actions will not work if the second action needs to refer to zzlextext. 11. When one is using multiple input files (for example "a.g" and "b.g" to generate "a.c" and "b.c") the only way to place file scope information in b.c is to place it in #header of the first grammar file. ANTLR won't allow file scope information to be copied from b.g into b.c using "<<...>>" notation. If one did place a file scope action in the b.g, ANTLR would try to interpret it as the fail action of the last rule appearing in a.g. (the first grammar file). The workaround is to #include b.c in another file which has your file scope declarations. You'll probably need to #include "stdpccts.h" before your file scope definitions. 12. Multiple parsers can coexist in the same application through use of the #parser directive (in C output mode). The #parser statement is not used with the ANTLR C++ output option because one can simply instantiate a new parser object. Page 4 13. Don't confuse #[...] with #(...). The first creates a single AST node (usually from a token identifier and an attribute) using the routine zzmk_ast(). The zzmk_ast() routine must be supplied by the user (or selected from one of the pccts supplied ones such as pccts/h/charbuf.h, pccts/h/charptr.*, and pccts/h/int.h). The second creates an AST list (usually more than a single node) from other ASTs by filling in the "down" field of the first node in the list to create a root node, and the "sibling" fields of each of remaining ASTs in the list. A null pointer is put in the sibling field of the last AST in the list. This is performed by the pccts supplied routine zztmake(). #token ID "[a-z]*" #token COLON ":" #token STMT_WITH_LABEL id! : ID <<#0=#[STMT_WITH_LABEL,$1];>> /* a1 */ Creates an AST. The AST (a single node) contains STMT_WITH_LABEL in the token field - given a traditional version of zzmk_ast(). rule! : id COLON expr /* a2 */ <<#0=#(#1,#3);>> Creates an AST list with the ID at its root and "expr" as its first (and only) child. The following example (a3) is equivalent to a1, but more confusing because the two steps above have been combined into a single action statement: rule! : ID COLON expr <<#0=#(#[STMT_WITH_LABEL,$1],#3);>> /* a3 */ Page 5 =============================================================================== Section on switches and options ------------------------------------------------------------------------------- 14. Don't forget about the ANTLR -gd option which provides a trace of rules which are triggered and exited. The trace option can be useful in sometimes unexpected ways. For example, by suitably defining the macros zzTRACEIN and zzTRACEOUT before the #include of "antlr.h" one can accumulate information on how often each rule is invoked. 15. When you want to inspect the code generated by ANTLR you may want to use the ANTLR -gs switch. This causes ANTLR to test for a token being an element of a lookahead set by using explicit tests with meaningful token names rather by using the faster bit oriented operations which are difficult to read. 16. When using the ANTLR -gk option you probably want to use the DLG -i option. As far as I can tell neither option works by itself. Unfortunately they have different abbreviations so that one can't use the same symbol for both in a makefile. 17. When you are debugging code in the rule section and there is no change to the lexical scanner, you can avoid regeneration of scanner.c by using the ANTLR -gx option. However some items from stdpccts.h can affect the scanner, such as -k -ck and the addition of semantic predicates - so this optimization should be used with a little care. 18. One cannot use an interactive scanner (ANTLR -gk option) with the ANTLR infinite lookahead and backtracking options (syntactic predicates). 19. If you want backtracking, but not the prefetching of characters and tokens that one gets with lookahead, then you might want to try using your own input routine and then using ANTLRs (input supplied by string) or ANTLRf (input supplied by function) rather than plain ANTLR which is used in most of the examples. See Example 4 below for an example of an ANTLRf input function. 20. The format used in #line directive is controlled by the macro #define LineInfoFormatStr "# %d \"%s\"\n" which is defined in generic.h. A change requires recompilation of ANTLR. Prior to version 1.21 ANTLR -gl will sometimes place #line directives in a column other than column 1. The temporary workaround is to change the format string to: #define LineInfoFormatStr "\n# %d \"%s\"\n" This is reported as fixed in versions >= 1.21 21. To make the lexical scanner case insensitive use the DLG -ci switch. The analyzer does not change the text, it just ignores case when matching it against the regular expressions. The problem in version 1.10 with the -ci switch is fixed in versions >= 1.20. Page 6 22. In order to use a different name for the mode.h file it is necessary to supply the new name using both the ANTLR -fm switch and the DLG -m switch. ANTLR does not generate mode.h, but it does generate #include statements which reference it. =============================================================================== C++ Mode ------------------------------------------------------------------------------- 23. When using backtracking (syntactic predicates) ANTLRtoken should be derived from ANTLRCommonBacktrackingToken rather than ANTLRCommonToken. Page 7 =============================================================================== Section on #token, #tokclass, #tokdef #errclass (but not #lexclass) ------------------------------------------------------------------------------- 24. To gobble up everything to a newline use: "~[\n]*". 25. To match any single character use: "~[]". 26. The char * array "zztokens" in err.c contains the text for the name of each token (indexed by the token number). This can be extremely useful for debugging and error messages. 27. If a #token symbol is spelled incorrectly in a rule it will not be reported by ANTLR unless the ANTLR -w2 option is set. ANTLR will assign it a new #token number which, of course, will never be matched. Look at token.h for misspelled terminals or inspect "zztokens[]" in err.c. 28. If you happen to define the same #token name twice (perhaps because of inadvertent duplication of a name) you will receive no error messages from ANTLR or DLG. ANTLR will simply use the later definition and forget the earlier one. Using the ANTLR -w2 option does not change this behavior. 29. One cannot continue a regular expression in a #token statement across lines. If one tries to use "\" to continue the line the lexical analyzer will think you are trying to match a newline character. 30. The escaped literals in #token regular expressions are not identical to the ANSI escape sequences. For instance "\v" will yield a match for "v", not a vertical tab. \t \n \r \b - the only escaped letters 31. In #token regular expressions spaces and tabs which are not escaped are ignored - thus making it easy to add white space to a regular expression. #token symbol "[a-z A-Z] [a-z A-Z 0-9]*" 32. You can achieve a limited form of one character lookahead in the #token statement action by using zzchar which contains the character following the regular expression just recognized. 33. The regular expressions appearing in #errclass declarations must be unique. Page 8 34. You cannot supply an action (even a null action) for a #token statement without a regular expression. You'll receive the message: warning: action cannot be attached to a token name (...token name...); ignored This is a minor problem when the #token is created for use with attributes or ASTs nodes and has no regular expression: #token CAST_EXPR #token SUBSCRIPT_EXPR #token ARGUMENT_LIST << ... Code related to parsing >> ANTLR assumes the code block is the action associated with the #token immediately preceding it. It is not obvious what the problem is because the line number referenced is the END of the code block (">>") rather than the beginning. My solution is to follow such #token statements with a #token which does have a regular expression (or a rule). 35. Since the lexical analyzer wants to find the longest possible string that matches a regular expression, it is probably best not to use expressions like "~[]*" which will gobble up everything to the end-of-file. 36. The lexical routines zzmode(), zzskip(), and zzmore() do NOT work like coroutines. Basically, all they do is set status bits or fields in a structure owned by the lexical analyzer and then return immediately. Thus it is OK to call these routines anywhere from within a lexical action. You can even call them from within a subroutine called from a lexical action routine. See Example 5 below for routines which maintain a stack of modes. Page 9 37. When a string is matched by two #token regular expressions of equal length, the lexical analyzer will choose the one which appears first in the source code. Thus more specific regular expressions should appear before more general ones: #token HELP "help" /* should appear before "symbol" */ #token symbol "[a-zA-Z]*" /* should appear after keywords */ Some of these may be caught by using the DLG switch -Wambiguity. Consider the following grammar: #header << #include "charbuf.h" >> << int main() { ANTLR (statement(),stdin); return 0; } >> #token WhiteSpace "[\ \t]" <<zzskip();>> #token ID "[a-z A-Z]*" #token HELP "HELP" statement : HELP "@" <<printf("token HELP\n");>> /* a1 */ | "inline" "@" <<printf("token inline\n");>> /* a2 */ | ID "@" <<printf("token ID\n");>> /* a3 */ ; Is an in-line regular expression treated any differently than a regular expression appearing in a #token statement ? No! ANTLR/DLG does *NOT* check for a match to "inline" (line a2) before attempting a match to the regular expressions defined by #token statements. The first two alternatives ("a1" and "a2") will NEVER be matched. All of this will be clear from examination of the file "parser.dlg". If one builds this example using the DLG switch -Wambiguity one gets the message: dlg warning: ambigious regular expression 3 4 dlg warning: ambigious regular expression 3 5 The numbers which appear in the DLG message refer to the assigned token numbers. Examine the array zztokens[] in err.c to find the regular expression which corresponds to the token number reported by DLG. ANTLRChar *zztokens[6]={ /* 00 */ "Invalid", /* 01 */ "@", /* 02 */ "WhiteSpace", /* 03 */ "ID", /* 04 */ "HELP", /* 05 */ "inline" }; One can also look at the file "scan.c" in which action 4 would appear in the function "static void act4() {...}". The best advice is to follow the example of the Master, TJP, and place things like #token ID at the end of the grammar file. Page 10 38. The DLG lexical analyzer is not able to backtrack. Consider the following example: #token "[\ \t]*] <<zzskip();>> #token ELSE "else" #token ELSEIF "else [\ \t]* if" #token STOP "stop" with input: else stop When DLG gets to the end of "else" it realizes that the spaces will allow it to match a longer string than "else" by itself. So DLG starts to accept the spaces. When DLG gets to the initial "s" in "stop" it realizes it has gone too far - but it can't backtrack. It passes back an error status to ANTLR which (normally) prints out something like: invalid token near line 1 (text was 'else ') ... There is an "extra" space between the "else" and the closing single quote mark. The fix for this particular example was to make ELSE accept the trailing spaces. #token "[\ \t]*] <<zzskip();>> #token ELSE "else [\ \t]*" #token ELSEIF "else [\ \t]* if" In other cases you may want to use #lexclass. You'll need to eliminate trailing space in the action part of the #token statement or have an attribute processing routine that can tolerate trailing spaces. In many cases the actual text of the reserved words is unimportant since only the token number is used. This problem is not detected by the DLG option -Wambiguity. 39. In converting a long list of tokens appearing in a rule to use #tokclass I simply replaced the rule, in situ, with the #tokclass directive and did a global replace of the rule name with a new name in which the first letter was capitalized. It took me a while to realize that the ANTLR message: xxx.g, line 123: warning: redefinition of tokclass or conflict w/token 'Literal'; ignored meant that I had used the #tokclass "Literal" before it was defined. Only rules, not tokens, can be used in forward references. The problem was fixed by moving the #tokclass statement up to the #token section of the file. 40. The char * variable zzbegexpr and zzendexpr point to the start and end of the string last matched by a regular expression in a #token statement. However, the char array pointed to by zzlextext may be larger than the string pointed to by zzbegexpr and zzendexpr because it includes substrings accumulated through the use of zzmore(). Page 11 41. The preprocessor symbol ZZCOL in the lexical scanner controls the update of column information. This doesn't cause the zzsyn() routine to report the position of tokens causing the error. You'll still have to write that yourself. The problem, I think, is that, due to look-ahead, the value of zzendcol will not be synchronized with the token causing the error, so that the problem becomes non-trivial. 42. If you want to use ZZCOL to keep track of the column position remember to adjust zzendcol in the lexical action when a character is not one print position wide (e.g. tabs or non-printing characters). 43. In version 1.00 it was common to change the token code based on semantic routines in the #token actions. With the addition of semantic predicates in 1.06 this technique is now frowned upon. Old style: #token TypedefName #token ID "[a-z A-Z]*" <<{if (isTypedefName(LATEXT(1))) NLA=TypedefName;};>> New Style: #token ID "[a-z A-Z]*" typedefName : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1>> ID; See the section on semantic predicates for a longer explanation. Page 12 44. DLG has no operator like grep's "^" which anchors a pattern to the beginning of a line. One can use tests based on zzbegcol only if column information is selected (#define ZZCOL) AND one is NOT using infinite lookahead mode (syntactic predicates). A technique which does not depend on zzbegcol is to look for the newline character and then enter a special #lexclass. Consider the problem of recognizing lines which have a "!" as the first character of a line. A possible solution suggested by Doug Cuthbertson is: #token "\n" <<zzline++; zzmode(BEGIN_LINE);>> *** or *** #token "\n" <<zzline++; if (zzchar=='!') zzmode(BEGIN_LINE);>> #lexclass BEGIN_LINE #token BANG "!" <<zzmode(START);>> #token "~[]" <<zzmode(START); zzmore();>> When a newline is encountered the #lexclass BEGIN_LINE is entered. If the next character is a "!" it returns the token "BANG" and returns to #lexclass START. If the next character is anything else it calls zzmore to accumulate additional characters for the token and, as before, returns to #lexclass START. (The order of calls to zzmode() and zzmore() is not significant). There are two limitations to this. a. If there are other single character tokens which can appear in the first column then using zzmore() won't be sufficient to work around the problem because the entire (one character) token has already been consumed. Thus all single character tokens which can appear in column 1 must appear in both #lexclass START and #lexclass BEGIN_LINE. b. The first character of the first line is not preceded by a newline. thus DLG will be starting in the wrong state. Thus you might want to rename "BEGIN_LINE" to "START" and "START" to "NORMAL". Another solution is to use ANTLRf (input from a function) to insert your own function to do the kind of lexical processing which is difficult to express in DLG. In 1.20 the macro ANTLRm was added. it is similar to ANTLR, but has an extra argument which allows one to specify the lexical class which is passed to zzmode() to specify the initial #lexclass state of DLG. 45. In version 1.10 there were problems using 8 bit characters with DLG. Versions >= 1.20 of ANTLR/DLG work with 8 bit character sets when they are compiled in a mode in which char variables are by default unsigned (the g++ option "-funsigned-char"). This should be combined with a call to setlocale (LC_ALL,"") to replace the default locale of "C" with the user's native locale. This is system dependent - it works with Unix systems but not DOS. Contributed by Ulfar Erlingsson (ulfarerl@rhi.hi.is). See Example 4 below. Page 13 =============================================================================== Section on #lexclass ------------------------------------------------------------------------------- 46. Special care should be taken when using "in-line" regular expressions in rules if there are multiple lexical classes #lexclass). ANTLR will place such regular expressions in the last lexical class defined. If the last lexical class was not START you may be surprised. #lexclass START .... #lexclass COMMENT .... inline_example: symbol "=" expression This will place "=" in the #lexclass COMMENT (where it will never be matched) rather than the START #lexclass where the user meant it to be. Since it is okay to specify parts of the #lexclass in several pieces it might be a good idea when using #lexclass to place "#lexclass START" just before the first rule - then any in-line definitions of tokens will be placed in the START #lexclass automatically. #lexclass START ... #lexclass A ... #lexclass B ... #lexclass START 47. A good example of the use of #lexclass are the definitions for C and C++ style comments, character literals, and string literals which can be found in pccts/lang/C/decl.g - or see Example 1 below. 48. The initial #lexclass of DLG is set by a data statement to START (which is 0). Unlike ANTLRm, the traditional ANTLR macros (ANTLRf, ANTLRs, and ANTLR) do NOT reset the #lexclass. If you call ANTLR multiple times during a program (for instance to parse each statement of a line-oriented language independently) DLG will resume in the #lexclass that it was in when ANTLR returned. If you want to restart DLG in the START state you should precede the call to ANTLR with zzmode(START); or use: ANTLRm (myStartRule(),myStartMode); Page 14 49. Consider the problem of a grammar in which a statement is composed of clauses, each of which has its own #lexclass and in which a given word is "reserved" in some clauses and not others: #1;1-JAN-94 01:23:34;enable;a b c d;this is a comment; #2;1-JAN-94 08:01:56;operator;smith;move to another station; #3;1-JAN-94 09:10:11;move;old pos=5.0 new pos=6.0;operator request; #4;1-JAN-94 10:11:12;set-alarm;beeper;2-JAN-94 00:00:01; One would like to reuse a #lexclass if possible. There is no problem with maintaining a stack of modes (#lexclass numbers) and pushing a new mode onto the stack each time a new #lexclass subroutine is called. How to do this is demonstrated in Example 5. The problem appears when it is necessary to leave a #lexclass and return more than one level. To be more specific, a #token action can only be executed when one or more characters is consumed - so to return through three levels of #lexclass calls would appear to require the consumption of at least three characters. In the case of balanced constructs like "...", and '...', or (...) this is not a problem since the terminating character can be used to trigger the #token action. However, if the scan is terminated by a "separator", such as the semi-colon above (";"), one cannot use the same technique. Once the semi-colon is consumed it is unavailable for the other #lexclass routines on the stack to see. My solution is to allow the user to specify (during the call to pushMode) a "lookahead" routine to be called when the corresponding element of the mode stack is popped. At that point the "lookahead" routine can examine zzchar to determine whether it also wants to pop the stack, and so on up the mode stack. The consumption of a single character can result in popping multiple modes from the mode stack based on a single character of lookahead. See the second part of Example 5 below. Continuing with the example of the log file (above): each statement type has its fields in a specific order. When the statement type is recognized a pointer is set to a list of the #lexclasses which is in the same order as the remaining fields of that kind of statement. An action attached to every #token which recognizes a semi-colon (";") advances a pointer in the list of #lexclasses and then changes the #lexclass by calling zzmode() to set the #lexclass for the next field of the statement. Page 15 =============================================================================== Section on rules ------------------------------------------------------------------------------- 50. If a rule is not used (is an orphan) it can lead to unanticipated reports of ambiguity. Use the ANTLR cross-reference option (-cr) to locate rules which are not referenced. Not verified in version 1.20. 51. ANTLR attempts to deduce "start" rules by looking for rules which are not referenced by any other rules. When it finds such a rule it assumes that an EOF token ("@") should be there and adds one if the user did not code one. This is the only case, according to TJP, when ANTLR adds something to the user's grammar. 52. To express the idea "any single token is acceptable at this point" use the "." token wild-card. This can be very useful for providing a context dependent error message, rather than the all purpose message "syntax error". if-stmt : IF "$" expr "$" stmt | IF . <<printf("If statement requires expression ", "enclosed in parenthesis"); PARSE_FAIL; >> It is probably best not to use expressions such as: ignore: (.)* /* Not a good idea */ which will gobble up everything to the end-of-file. 53. New to version 1.20 is the "~" operator for tokens. It allows one to specify tokens which must NOT match in order to match a rule. The "~" operator cannot be applied to rules. To express the idea "if this rule doesn't match try to match this other rule" use syntactic predicates. 54. Some constructs which are bound to cause warnings about ambiguities: rule : a { ( b | c )* }; rule : a { b }; b : ( c )*; rule : a c*; a : b { c }; rule : a { b | c | }; Page 16 55. Don't confuse init-actions with actions which precede a rule. If the first element following the start of a rule or sub-rule is an action it is always interpreted as an init-action. An init-action occurs in a scope which include the entire rule or sub-rule. An action which is NOT an init-action is enclosed in "{" and "}" during generation of code for the rule and has essentially zero scope - the action itself. The difference between an init-action and an action which precedes a rule can be especially confusing when an action appears at the start of an alternative: These APPEAR to be almost identical, but they aren't: b : <<int i=0;>> b1 > [i] /* b1 <<...>> is an init-action */ | <<int j=0;>> b2 > [j] /* b2 <<...>> is part of the rule */ ; /* and will cause a compilation error */ On line "b1" the <<...>> appears immediately after the beginning of the rule making it an init-action. On line "b2" the <<...>> does NOT appear at the start of a rule or sub-rule, thus it is interpreted as an action which happens to precede the rule. This can be especially dangerous if you are in the habit of rearranging the order of alternatives in a rule. For instance: Changing this: b : <<int i=0,j=0;>> <<i++;>> b1 > [i] /* c1 */ | <<j++;>> b1 > [i] /* c2 */ ; to: b : /* empty production */ /* d1 */ | <<int i=0,j=0;>> <<i++;>> b1 > [i] /* d2 */ | <<j++;>> b1 > [i] ; or to this: b : <<j++;>> b1 > [i] /* e1 */ | <<int i=0,j=0;>> <<i++;>> b1 > [i] /* e2 */ changes an init-action into a non-init action, and vice-versa. Page 17 56. A particularly nasty form of the init-action problem is when an empty sub-rule has an associated action: rule!: ID ( /* empty */ <<#0=#[ID,$1.1];>> | array_bounds <<#0=#[T_array_declaration,$1.1],#1);>> ) ; Since there is no reserved word in pccts for epsilon, the action for the empty arm of the sub-rule becomes the init-action. For this reason it's wise to follow one of the following conventions (1) represent epsilon with an empty rule "()" or (2) put empty as the last rule in a list of alternatives: rule!: ID ( () <<#0=#[ID,$1.1];>> | array_bounds <<#0=#[T_array_declaration,$1.1],#1);>> ) ; The cost of using "()" to represent epsilon is the execution of the macro zzBLOCK() at the start of the sub-rule and zzEXIT() at the end of the sub-rule. Macro zzBLOCK() creates a temporary stack pointer for the attribute stack and checks for overflow. Macro zzEXIT() pops any attributes that might have been placed on attribute stack. Since no attribute stack operations take place for epsilon this is wasted CPU cycles, however this is probably not a significant cost for many users. 57. Another form of problem caused by init-action occurs when one comments out a rule in the grammar in order to test an idea: rule /* a1 */ : <<init-action;> /* a2 */ //// rule_a /* a3 */ | rule_b /* a4 */ | rule_c /* a5 */ In this case one only wanted to comment out the "rule_a" reference in line "a3". The reference is indeed gone, but the change has introduced an epsilon production - which probably creates a large number of ambiguities. Without the init-action the ":" would have probably have been commented out also, and ANTLR would report a syntax error - thus preventing one from shooting oneself in the foot. 58. In the case of sub-rules such as (...)+, (...)*, and {...} the init-action is executed just once before the sub-rule is entered. Consider the following example from section 3.6.1 (page 29) of the 1.00 manual: a : <<List *p=NULL;>> // initialize list Type ( <<int i=0;>> // initialize index Var <<append(p,i++,$1);>> )* <<OperateOn(p);>> ; Page 18 59. Associativity and precedence of operations is determined by nesting of rules. In the example below "=" associates to the right and has the lowest precedence. Operators "+" and "*" associate to the left with "*" having the highest precedence. expr0 : expr1 {"=" expr0}; expr1 : expr2 ("\+" expr2)*; expr2 : expr3 ("\*" expr3)*; expr3 : ID; See Example 2. 60. Fail actions for a rule can be placed after the final ";" of a rule. These will be: "executed after a syntax error is detected but before a message is printed and the attributes have been destroyed. However, attributes are not valid here because one does not know at what point the error occurred and which attributes even exist. Fail actions are often useful for cleaning up data structures or freeing memory." (Page 29 of 1.00 manual) Example of a fail action: a : <<List *p=NULL;>> ( Var <<append(p,$1);>> )+ <<operateOn(p);rmlist(p);>> ; <<rmlist(p);>> ************** <--- Fail Action 61. When you have rules with large amounts of lookahead (that may cross several lines) you can use the ANTLR -gk option to make an ANTLR-generated parser delay lookahead fetches until absolutely necessary. To get better line number information (e.g. for error messages or #line directives) place an action which will save "zzline" in a variable at the start of the production where you want better line number information: a : <<int saveCurrentLine;>> <<saveCurrentLine = zzline;>> A B C << /* use saveCurrentLine not zzline here */ >> | <<saveCurrentLine = zzline;>> A B D << /* use saveCurrentLine not zzline here */ >> ; After the production has been matched you can use saveCurrentLine rather than the bogus "zzline". Contributed by Terence "The ANTLR Guy" Parr (parrt@acm.org) In version 1.20 a new macro, ZZINF_LINE(), was added to extract line information in a manner similar to LATEXT when using infinite lookahead mode. See the page 6 of the 1.20 release notes for more information. There is nothing like ZZINF_COL() for column information, but it should be easy to create using ZZINF_LINE() as a model. 62. An easy way to get a list of the names of all the rules is to grep tokens.h for the string "void" or edit the output from ANTLR run with the -cr option (cross-reference). Page 19 63. It took me a while to understand in an intuitive way the difference between full LL(k) lookahead given by the ANTLR -k switch and the linear approximation given by the ANTLR -ck switch. This was in spite of the example given in section 5 (pages 18 to 21) of the 1.10 release notes. Most of the time I run ANTLR with -k 1 and -ck 2. Because I didn't understand the linear approximation I didn't understand the warnings about ambiguity. I couldn't understand why ANLTR would complain about something which I thought was obviously parse-able with the lookahead available. Was it a bug or was it me ? I would try to make the messages go away totally, which was sometimes very hard. If I had understood the linear approximation I might have been able to fix them easily or at least have realized that there was no problem with the grammar, just with the limitations of the linear approximation. I will restrict the discussion to the case of "-k 1" and "-ck 2". Consider the following example: rule1 : rule2a | rule2b | rule2c ; rule2a : A X | B Y | C Z ; rule2b : B X | B Z ; rule2c : C X ; It should be clear that with the sentence being only two tokens this should be parseable with LL(2). Instead, because k=1 and ck=2 ANTLR will produce the following messages: /pccts120/bin/antlr -k 1 -gs -ck 2 -gh example.g Antlr parser generator Version 1.20 1989-1994 example.g, line 23: warning: alts 1 and 2 of the rule itself ambiguous upon { B }, { X Z } example.g, line 23: warning: alts 1 and 3 of the rule itself ambiguous upon { C }, { X } The code generated resembles the following: if (LA(1)==A || LA(1)==B || LA(1)==C) && (LA(2)==X || LA(2)==Y || LA(2)==Z) then rule2a() else if (LA(1)==B) && (LA(2)==X || LA(2)==Y) then rule2b() else if (LA(1)==C) && (LA(2)==Z) then rule3a() ... This might be called "product-of-sums". There is an "or" part for LA(1), an "or" part for LA(2), and they are combined using "and". To match, the first lookahead token must be in the first set and the second lookahead token must be in the second set. It doesn't matter that what one really wants is: Page 20 if (LA(1)==A && LA(2)==X) || (LA(1)==B && LA(2)==Y) || (LA(1)==C && LA(2)==Z) then rule2a() else if (LA(1)==B && LA(2)==X) || (LA(1)==B && LA(2)==Z) then rule2b() else if (LA(1)==C && LA(2)==X) then rule2c() This happens to be "product-of-sums" but the real problem is that each product involves one element from LA(1) and one from LA(2) and as the number of possible tokens increases the number of terms grows as N**2. With the linear approximation the number of terms grows (surprise) linearly in the number of tokens. ANTLR won't do this with k=1, it will only do "product-of-sums". However, all is not lost - you simply add a few well chosen semantic predicates which you have computed using your LL(k>1), all purpose, carbon based, analog computer. The linear approximation selects for each branch of the "if" a set which MAY include more than what is wanted. It never selects a subset of the correct lookahead sets! We simply insert a hand-coded version of the LL(2) computation. It's ugly, especially in this case, but it fixes the problem. In large grammars it may not be possible to run ANTLR with k=2, so this fixes a few rules which cause problems. The generated parser may run faster because it will have to evaluate fewer terms at execution time. << int bypass_rule2a() { if ( LA(1)==B && LA(2)==Y ) return 0; if ( LA(1)==B ) return 1; if ( LA(1)==C && LA(2)==X ) return 1; return 0; } >> rule1 : <<bypass_rule2a()>>? rule2a | rule2b | rule2c ; rule2a : A X | B Y | C Z ; rule2b : B X | B Z ; rule2c : C X ; The real cases I've coded have shorter code sequences in the semantic predicate. I coded this as a function to make it easier to read and because there is a bug in 1.1x and 1.2x which prevents semantic predicates from crossing lines. Another reason to use a function (or macro) is to make it easier to read the generated code to determine when your semantic predicate is being hoisted too high (it's easy to find references to a function name with the editor - but difficult to locate a particular sequence of "LA(1)" and "LA(2)" tests. Predicate hoisting is a separate issue which is described elsewhere in this note. Page 21 In some cases of reported ambiguity it is not necessary to add semantic predicates because no VALID token sequence could get to the wrong rule. If the token sequence were invalid it would be detected by the grammar eventually, although perhaps not where one might wish. In other cases the only necessary action is a reordering of the ambiguous rules so that a more specific rule is tested first. The error messages still appear, but one can ignore them or place a trivial semantic predicate (i.e. <<1>>? ) in front of the later rules. This makes ANTLR happy because it thinks you've added a semantic predicate which fixes things. Some constructs just invite problems. For instance in C++ with a suitable definition of the class "C" one can write: C a,b,c /* a1 */ a.func1(b); /* a2 */ a.func2()=c; /* a3 */ a = b; /* a4 */ a.operator =(b); /* a5 */ Statement a5 happens to place an "=" (or any of the usual C++ operators) in a token position where it can cause a lot of ambiguity in the lookahead. set. I eventually solved this particular problem by creating a special #lexclass for things which follow "operator". I use an entirely different token number for such operators - thereby avoiding the whole problem. // // C++ operator sequences // // operator <type_name> // operator <special characters> // // There must be at least one non-alphanumeric character between // "operator" and operator name - otherwise they would be run // together - ("operatorint" instead of "operator int") // #lexclass LEX_OPERATOR #token FILLER_C1 "[\ \t]*" <<zzskip(); if( isalnum(zzchar) ) zzmode(START); >> #token OPERATOR_STRING "[\+\-\*\/\%\^\&\|\~\!\=\<\>]*" <<zzmode(START);>> #token FILLER_C2 " | \[\] " <<NLA=OPERATOR_STRING;zzmode(START);>> Page 22 =============================================================================== Section on Attributes ------------------------------------------------------------------------------- 64. Attributes are built automatically only for terminals. For rules (non-terminals) one must assign an attribute to $0, use the $[token,...] convention for creating attributes, or use zzcr_attr(). 65. The way to access the text (or whatever) part of an attribute depends on the way the attribute is stored. If one uses the pccts supplied routine "pccts/h/charbuf.h" then id : "[a-z]+" <<printf("Token is %s\n",$1.text);>> If one uses the pccts supplied routine "pccts/h/charptr.c" and "pccts/h/charptr.h" then: id : "[a-z]+" <<printf("Token is %s\n",$1);>> If one uses the pccts supplied routine "pccts/h/int.h" (which stores numbers only) then: number : "[0-9]+" <<printf ("Token is %d\n",$1);>> Note the use of %d rather than %s in the printf() format. 66. The expression $$ refers to the attribute of the named rule. The expression $0 refers to the attribute of the the enclosing rule, (which might be a sub-rule). rule : a b (c d (e f g) h) i For (e f g) $0 becomes $3 of (c d ... h). For (c d ... h) $0 becomes $3 of (a b ... i). However $$ always is equivalent to $rule. 67. If you define a zzcr_attr() or zzmk_attr() which allocates resources such as strings from the heap don't forget to define a zzd_attr() routine to release the resources when the attribute is deleted. 68. Attributes go out of scope when the rule or sub-rule that defines them is exited. Don't try to pass them to an outer rule or a sibling rule. The only exception is $0 which may be passed back to the containing rule as a return argument. However, if the attribute contains a pointer which is copied (e.g. pccts/h/charptr.c) then extra caution is required because of the actions of zzd_attr(). For C++ users this should be implemented in the class copy constructor. The version of pccts/h/charptr.* distributed with pccts does not use C++ features. See the next item for more information. 69. The pccts/h/charptr.c routines use a pointer to a string. The string itself will go out of scope when the rule or sub-rule is exited. Why ? The string is copied to the heap when ANTLR calls the routine zzcr_attr() supplied by charptr.c - however ANTLR also calls the charptr.c supplied routine zzd_attr() (which frees the allocated string) as soon as the rule or sub-rule exits. The result is that in order to pass charptr.c strings to outer rules (for instance to $0) it is necessary to make an independent copy of the string using strdup or else zero the pointer to prevent its deallocation. Page 23 70. To initialize $0 of a sub-rule use a construct like the following: decl : typeID Var <<$2.type = $1;>> ( "," Var <<$2.type = $0;>>)*[$1] **** <-------------- See section 4.1.6.1 (page 29) of the 1.00 manual 71. One can use the zzdef0() macro to define a standard method for initializing $0 of a rule or sub-rule. If the macro is defined it is invoked as zzdef0(&($0)). See section 4.1.6.1 (page 29) of the 1.00 manual I believe that for C++ users this would be handled by the class constructor. 72. If you construct temporary attributes in the middle of the recognition of a rule, remember to deallocate the structure should the rule fail. The code for failure goes after the ";" and before the next rule. For this reason it is sometimes desirable to defer some processing until the rule is recognized rather than the most convenient place. #include "pccts/h/charptr.h" ;statement! : <<char *label=0;>> {ID COLON <<label=MYstrdup($1);>> } statement_without_label <<#0=#(#[T_statement,label],#2); if (label!=0) free(label); // AST #1 is undefined // AST #2 is returned by // statement_without_label >> ;<<if (label !=0) free(label);>> In the above example attributes are handled by charptr.*. Readers of this note have been warned earlier about its dangers. The routine I have written to construct ASTs from attributes (invoked by #[int,char *]) knows about this behavior and automatically makes a copy of the character string when it constructs the AST. This makes the copy created by the explicit call to MYstrdup redundant once the AST has been constructed. If the call to "statement_without_label" fails then the temporary copy must be deallocated. Page 24 =============================================================================== Section on ASTs ------------------------------------------------------------------------------- 73. If you define a zzcr_ast() or zzmk_ast() which allocates resources such as strings from the heap don't forget to define a zzd_ast() routine to release the resources when the AST is deleted. For C++ users this should be implemented as part of the class destructor. 74. If you construct temporary ASTs in the middle of the recognition of a rule, remember to deallocate the structure should the rule fail. The code for failure goes after the ";" and before the next rule. For this reason it is sometimes desirable to defer some processing until the rule is recognized rather than the most appropriate place. For C++ users this might be implemented as part of the class destructor. If the temporary is an AST returned by a called rule then you'll probably have to call zzfree_ast() to release the entire AST tree. Consider the following example: obj_name! /* a1 */ : <<AST *node=0;>> /* a2 */ class_name <<node=#1;>> /* a3 */ ( /* a4 */ () /* empty */ /* a5 */ <<#0=node;node=0;>> /* a6 */ | COLON_COLON follows_dot_class[node] /* a7 */ <<#0=#2;node=0;>> /* a8 */ ) /* a9 */ ......... /* a10 */ /* a11 */ ; <<if (node!=0) zzfree_ast(node);>> /* a12 */ In this case "class_name" may return a full AST tree (not a trivial tree) because of information required to represent template classes (e.g. dictionary<int,1000> is a "class_name"). This tree ("node") is passed to another rule ("follows_dot_class") which uses it to construct another AST tree which incorporates it. If "follows_dot_class" succeeds then node is set to 0 (lines a6 or a8) because the tree is now referenced via #2. If "follows_dot_class" fails then the entire tree created by class_name must be deallocated (line a12). The temporary "node" must be used because there is no convenient way (such as #1.1) to refer to class_name from within the sub-rule. Please note the use of an empty sub-rule ("()" on line a5) to avoid the nasty init-action problem mentioned earlier. 75. Example 6 shows debugging code to help locate ASTs that were created but never deleted. 76. If you want to place prototypes for routines that have an AST as an argument in the #header directive you should explicitly #include "ast.h" after the #define AST_FIELDS and before any references to AST: #define AST_FIELDS int token;char *text; #include "ast.h" #define zzcr_ast(ast,attr,tok,astText) \ create_ast(ast,attr,tok,text) void create_ast (AST *ast,Attr *attr,int tok,char *text); Page 25 77. The make-a-root operator for ASTs ("^") can be applied only to terminals. (This includes items identified in #token ,#tokclass, and #tokdef statements). I think this is because a child rule might return a tree rather than a single AST. If it did then it could not be made into a root as it is already a root and the corresponding fields of the structure are already in use. To make an AST returned by a called rule a root use the expression: #(root-rule sibling1 sibling2 sibling3). add ! : expr ("\+"^ expr) ; // Is ok addOperator ! : expr (AddOp expr) // Is NOT ok addOp : "\+" | "-"; // Example 2 describes a workaround for this restriction. 78. Because it is not possible to use an already constructed AST tree as the root of a new tree (unless it's a trivial tree with no children) one should be suspicious of any constructs like the following: rule! : ........ <<#0=#(#1,...)...;>> ** <===================== If #1 is a non-trivial tree its existing children will be lost when the new tree is constructed for assignment to #0. 79. Do not assign to #0 of a rule unless automatic construction of ASTs has been disabled using the "!" operator: a! : x y z <<#0=#(#1 #2 #3);>> // ok a : x y z <<#0=#(#1 #2 #3);>> // NOT ok The reason for the restriction is that assignment to #0 will cause any ASTs pointed to by #0 to be lost when the pointer is overwritten. The stated restriction is somewhat stronger than necessary. You can assign to #0 even when using automated AST construction, if the old tree pointed to by #0 is part of the new tree constructed by #(...). For example: #token COMMA "," #token STMT_LIST stmt_list: stmt (COMMA stmt)* <<#0=#(#[STMT_LIST],#0);>> The automatically constructed tree pointed to by #0 is just put at the end of the new list, so nothing is lost. If you reassign to #0 in the middle of the rule, automatic tree construction will result in the addition of remaining elements at the end of the new tree. This is not recommended by TJP. Special care must be used when combining the make-as-root operator (e.g. rule: expr OP^ expr) with this transgression (assignment to #0 when automatic tree construction is selected). Page 26 80. Even when automatic construction of ASTs is turned off in a rule the called rules still return the ASTs that they constructed. The same applies when the "!" operator is applied to a called rule. This is hard to believe when one sees a rule like the following: rule: a! b! c! generate (in part) a sequence of operations like: _ast = NULL; a(&_ast); _ast = NULL; b(&_ast); _ast = NULL; c(&_ast); It appears that the AST pointer is being assigned to a temporary where it becomes inaccessible. This is not the case at all. The called rule is responsible for placing a pointer to the AST which is constructed onto a stack of AST pointers. The stack of AST pointers is normally in global scope with ZZAST_STACKSIZE elements. (The "!" operator simply inhibits the automatic construction of the AST trees. It does not prevent the construction of the ASTs themselves. When calling a rule which constructs ASTs and not using the result one must destroy the constructed AST using zzfree_ast() in order to avoid a memory leak. See Example 6 below for code which aids in tracking lost ASTs). Consider the following examples (using the list notation of page 45 of the 1.00 manual): a: A; b: B; c: C; #token T_abc_node rule : a b c ; <<;>> /* AST list (0 A B C) without root */ rule ! : a b c <<#0=#(0,#1,#2,#3);>> /* AST list (0 A B C) without root */ rule : a! b! c! <<#0=#(0,#1,#2,#3);>> /* AST list (0 A B C) without root */ rule : a^ b c /* AST tree (A B C) with root A */ rule ! : a b c <<#0=#(#1,#2,#3);>> /* AST tree (A B C) with root A */ rule ! : a b c <<#0=#(#[T_abc_node,0],#1,#2,#3);>> /* AST tree (T_abc_node A B C) */ /* with root T_abc_node */ rule : a b c <<#0=#(#[T_abc_node,0],#0);>> /* the same as above */ rule : a! b! c! <<#0=#(#[T_abc_node,0],#1,#2,#3);>> /* the same as above */ rule ! : a b c <<#0=#(toAST(T_abc_node),#1,#2,#3);>> /* the same as above */ rule : a b c <<#0=#(toAST(T_abc_node),#0);>> /* the same as above */ rule : a! b! c! <<#0=#(toAST(T_abc_node),#1,#2,#3);>> /* the same as above */ The routine "toAST()" calls zzmk_ast() to construct an AST given the token number. For a typical version of zzmk_ast() it would look something like the following: AST * toAST (int tokenID) { return zzmk_ast (zzastnew(),tokenID,NULL); } I find toAST() more convenient than passing the extra arguments to zzmk_ast() using a construct like #[T_abc_node,0] or writing zzmk_ast() with varargs. Using varargs defeats most forms of inter-procedural type checking (unless you are using C++ which allows overloaded function names). Page 27 81. If you use ASTs you have to pass a root AST to ANTLR. AST *root=NULL; again: ANTLR (start(&root),stdin); walk_the_tree(root); zzfree_ast(root); root=NULL; goto again; 82. zzfree_ast(AST *tree) will recursively descend the AST tree and free all sub-trees. The user should supply a routine zzd_ast() to free any resources used by a single node - such as pointers to character strings allocated on the heap. See Example 2 on associativity and precedence. 83. AST elements in rules are assigned numbers in the same fashion as attributes with three exceptions: 1. A hole is left in the sequence when sub-rules are encountered. (e.g. "(...)+", "(...)*", and "{...}"). 2. #0 is the AST of the named rule, not the sub-rule - see the next item 3. There is nothing analogous to $i.j notation (which allows one to refer to attributes from earlier in the rule). In other words, you can't use #i.j notation to refer to an AST created earlier in the rule. There are plans to add notation similar to Sorcerer's tagged elements, but this will not appear for a while. (24-Aug-94). It is not part of version 1.22. Consider the following example: a : b // B is #1 for the rule (c d)* // C is #1 when scope is inside the sub-rule // D is #2 when scope is inside the sub-rule // You may *NOT* refer to b as #1.1 e // E is #3 for the rule // There is NO #2 for the rule 84. The expression #0 refers to the AST of the named rule. Thus it is a misnomer and (for consistentcy) should probably have been named ## or #$. There is nothing equivalent to $0 for ASTs. This is probably because sub-rules aren't assigned AST numbers in a rule. See above. Page 28 85. Associativity and precedence of operations is determined by nesting of rules. In the example below "=" associates to the right and has the lowest precedence. Operators "+" and "*" associate to the left with "*" having the highest precedence. expr0 : expr1 {"=" expr0}; expr1 : expr2 ("\+" expr2)*; expr2 : expr3 ("\*" expr3)*; expr3 : ID; In Example 2 the zzpre_ast() routine is used to walk all the AST nodes. The AST nodes are numbered during creation so that one can see the order in which they are created and the order in which they are deleted. Do not confuse the "#" in the sample output with the AST numbers used to refer to elements of a rule in the action part of a the rule. The "#" in the sample output are just to make it simpler to match elements of the expression tree with the order in which zzd_ast() is called for each node in the tree. 86. If the make-as-root operator were NOT used in the rules: ;expr0 : expr1 {"=" expr0} ;expr1 : expr2 ("\+" expr2)* ;expr2 : expr3 ("\*" expr3)* ;expr3 : ID With input: a+b*c The output would be: a <#1> \+ <#2> b <#3> \* <#4> c <#5> NEWLINE <#6> zzd_ast called for <node #6> zzd_ast called for <node #5> zzd_ast called for <node #4> zzd_ast called for <node #3> zzd_ast called for <node #2> zzd_ast called for <node #1> Page 29 87. Suppose that one wanted to replace the terminal "+" with the rule: addOp : "\+" | "-" ; Then one would be unable to use the "make-as-root" operator because it can be applied only to terminals. There are two workarounds. The #tokclass feature allows one to write: #tokclass AddOp { "\+" "\-"} A #tokclass identifier may be used in a rule wherever a simple #token identifier may be used. The other workaround is much more complicated: expr : (expr0 NEWLINE) ;expr0 : expr1 {"="^ expr0} ;expr1! : expr2 <<#0=#1;>> (addOp expr2 <<#0=#(#1,#0,#2);>> )* ;expr2 : expr3 ("\*"^ expr3)* ;expr3 : ID ;addOp : "\+" | "\-" With input: a-b-c The output is: ( \- <#4> ( \- <#2> a <#1> b <#3> ) c <#5> ) NEWLINE <#6> The "!" for rule "expr1" disables automatic constructions of ASTs in the rule. This allows one to manipulate #0 manually. If the expression had no addition operator then the sub-rule "(addOp expr)*" would not be executed and #0 will be assigned the AST constructed by rule expr2 (i.e. AST #1). However if there is an addOp present then each time the sub-rule is rescanned due to the "(...)*" the current tree in #0 is placed as the first of two siblings underneath a new tree. This new tree has the AST returned by addOp (AST #1 of the addOp sub-rule) as the root. 88. There is an option for doubly linked ASTs in the module ast.c. It is controlled by #define zzAST_DOUBLE. Even with zzAST_DOUBLE only the right and down fields are filled while the AST tree is constructed. Once the tree is constructed the user must call the routine zzdouble_link(tree,NULL,NULL) to traverse the tree and fill in the left and up fields. See page 12 of the 1.06 manual for more information. 89. If a rule which creates an AST is called and the result is not linked into the tree being constructed then zzd_ast() will not be called to release the resources used by the rule. Prior to version 1.20 this was especially important when rules were used in syntactic predicates. Versions >= 1.20 bypasses construction of all ASTs during guess mode. Page 30 =============================================================================== Section on Semantic Predicates ------------------------------------------------------------------------------- 90. There is a bug in 1.1x and 1.2x which prevents semantic predicates from including string literals. The predicate is incorrectly "string-ized" in the call to zzfailed_predicate. rule: <<containsCharacter("!@#$%^&*",LATEXT(1))>>? ID /* Will not work */ The workaround is to place the literal in a string constant and use the variable name. 91. There is a bug in 1.1x and 1.2x which prevents semantic predicates from crossing lines unless one uses an escaped newline. rule: <<do_test();\ /*** Note escaped newline ***/ this_works_in_120)>>? x y z; 92. Semantic predicates are enclosed in "<<... >>?" but because they are inside "if" statements they normally do not end with a ";" - unlike other code enclosed in "<<...>>" in ANTLR. 93. Semantic predicates which are not the first element in the rule or sub-rule become "validation predicates" and are not used for prediction. After all, if there are no alternatives, then there is no need for prediction - and alternatives exist only at the left edge of rules and sub-rules. Even if the semantic predicates are on the left edge it is no guarantee that it will be part of the prediction expression. Consider the following two examples: a : << LA(1)==ID ? propX(LATEXT(1)) : 1 >>? ID glob /* a1 */ | ID glob /* a2 */ ; b : << LA(1)==ID ? propX(LATEXT(1)) : 1 >>? ID glob /* b1 */ | NUMBER glob /* b2 */ ; Rule a requires the semantic predicate to disambiguate alternatives a1 and a2 because the rules are otherwise identical. Rule b has a token type of NUMBER in alternative b2 so it can be distinguished from b1 without evaluation of the semantic predicate during prediction. In both cases the semantic predicate will also be evaluated inside the rule. When the tokens which can follow a rule allow ANTLR to disambiguate the expression without resort to semantic predicates ANTLR may not evaluate the semantic predicate in the prediction code. For example: simple_func : <<LA(1)==ID ? isSimpleFunc(LATEXT(1)) : 1>>? ID complex_func : <<LA(1)==ID ? isComplexFunc(LATEXT(1)) : 1>>? ID function_call : "(" ")" func : simple_func function_call | complex_func "." ID function_call In this case, a "simple_func" MUST be followed by a "(", and a "complex_func" MUST be followed by a ".", so it is unnecessary to evaluate the semantic predicates in order to predict which of the alternative to use. A simple test of the lookahead tokens is sufficient. As stated before, the semantic predicates will still be used to validate the rule. Page 31 94. Suppose that the requirement that all semantic predicates which are used in prediction expressions must appear at the left hand edge of a rule were lifted? Consider the following code segment: cast_expr /* a1 */ : LP typedef RP cast_expr /* a2 */ | expr13 /* a3 */ ;expr13 /* a4 */ : id_name /* a5 */ | LP cast_expr RP /* a6 */ ;typedef /* a7 */ : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1 >>? ID /* a8 */ ;id_name /* a9 */ : ID /* a10 */ Now consider the token sequences: Token: #1 #2 #3 #4 -- ----------------------- -- -- "(" ID-which-is-typedef ")" ID "(" ID-which-is-NOT-typedef ")" Were the semantic predicate at line a8 hoisted to predict which alternative of cast_expr to use (a2 or a3) the program would use the wrong lookahead token (LA(1) and LATEXT(1)) rather than LA(2) and LATEXT(2) to check for an ID which satisfies "isTypedefName()". This is because it is preceded by a "(". This problem could perhaps be solved by application of sufficient ingenuity, however, in the meantime the solution is to rewrite the rules so as to move the decision point to the left edge of the production. First perform in-line expansion of expr13 (line a3) in cast_expr: cast_expr /* b1 */ : LP typedef RP cast_expr /* b2 */ | id_name /* b3 */ | LP cast_expr RP /* b4 */ Secondly, move the alternatives (in cast_expr) beginning with LP to a separate rule so that "typedef" and "cast_expr" will be on the left edge: cast_expr /* c1 */ : LP cast_expr_suffix /* c2 */ | id_name /* c3 */ ;cast_expr_suffix /* c4 */ : typedef RP cast_expr /* c5 */ | cast_expr RP /* c6 */ ;typedef /* c7 */ : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1 >>? ID /* c8 */ ;id_name /* c9 */ : ID /* c10 */ This will result in the desired treatment of the semantic predicate to choose from alternatives c5 and c6. Page 32 95. Validation predicates are evaluated by the parser. If they fail a call to zzfailed_predicate(string) is made. To disable the message redefine the macro zzfailed_predicate(string) or use the optional "failed predicate" action which is enclosed in "[" and "]" and follows immediately after the predicate: a : <<LA(1)==ID ? isTypedef(LATEXT(1)) : 1>>?[printf("Not a typedef\n");] 96. An expression in a semantic predicate (e.g. <<isFunc()>>? ) should not have side-effects. If there is no match then the rest of the rule using the semantic predicate won't be executed. Page 33 97. What is the "context" of a semantic predicate ? Answer due to TJP: The context of a predicate is the set of k-strings (comprised of lookahead symbols) that can be matched following the execution of a predicate. For example, a : <<p>>? alpha ; The context of "p" is LOOK(alpha) where LOOK(alpha) is the set of lookahead k-strings for alpha. Normally, one should compute the context for ANTLR (manually) because ANTLR is not smart enough to know the nature of your predicate and does not know how much context information is needed; it's conservative and tries to compute full LL(k) lookahead. Normally, you only need one token: class_name: <<isClass(LATEXT(1))>>? ID ; This example is incomplete, the predicate should really be: class_name: <<LA(1)==ID ? isClass(LATEXT(1)) : 1>>? ID ; This says, "I can tell you something if you have an ID, otherwise just assume that the rule is semantically valid." This only makes a difference if the predicate is *hoisted* out of the rule. Here is an example that won't work because it doesn't have context check in the predicates: a : ( class_name | NUM ) | type_name ; class_name : <<isClass(LATEXT(1))>>? ID ; type_name : <<isType(LATEXT(1))>>? ID ; The prediction for production one of rule "a" will be: if ( LA(1) in { ID, NUM } && isClass(LATEXT(1)) ) { ... Clearly, NUM will never satisfy isClass(), so the production will never match. When you ask ANTLR to compute context, it can check for missing predicates. With -prc on, for this grammar: a : b | <<isVar(LATEXT(1))>>? ID | <<isPositive(LATEXT(1))>>? NUM ; b : <<isType(LATEXT(1))>>? ID | NUM ; ANTLR reports: warning alt 1 of rule itself has no predicate to resolve ambiguity upon \{ NUM \} Page 34 98. A documented restriction of ANTLR is the inability to hoist multiple semantic predicates. However, no error message is given when one attempts this. When compiled with k=1 and ck=2 this generates inappropriate code in "statement" when attempting to predict "expr": #header << #include "charbuf.h" int istypedefName (char *); int isCommand (char *); >> #token BARK #token GROWL #token ID statement : expr | declaration ;expr : commandName BARK | typedefName GROWL ;declaration : typedefName BARK ;typedefName : <<LA(1)==ID ? istypedefName(LATEXT(1)) : 1>>? ID ;commandName : <<LA(1)==ID ? isCommand(LATEXT(1)) : 1>>? ID ; The generated code resembles the following: void statement() { if ( (LA(1)==ID) && (LA(2)==BARK || LA(2)==GROWL) && ( (LA(1)==ID ? isCommand(LATEXT(1)) : 1) || (LA(1)==ID ? istypedefName(LATEXT(1)) : 1)) ) { expr(); } else { if ( (LA(1)==ID) && (LA(2)==BARK) && (LA(1)==ID ? istypedefName(LATEXT(1)) : 1)) ) { declaration(); } ... The problem is that "<typdefname> BARK" will be passed to expr() rather than declaration(). Some help is obtained by using leading actions to inhibit hoisting as described in the next notes. However, omitting all semantic predicates in the prediction expression doesn't help if one requires them to predict the rule. Page 35 99. Leading actions will inhibit the hoisting of semantic predicates into the prediction of rules. expr_rhs : <<;>> <<>> expr0 | command See the section on known bugs for a more complete example. 100. When using semantic predicates in ANTLR is is *IMPORTANT* to understand what the "-prc on" ("predicate context computation") option does and what "-prc off" doesn't do. Consider the following example: +------------------------------------------------------+ | Note: All examples in this sub-section are based on | | code generated with -k=1 and -ck=1. | +------------------------------------------------------+ expr : upper | lower | number ; upper : <<isU(LATEXT(1))>>? ID ; lower : <<isL(LATEXT(1))>>? ID ; number : NUMBER ; With "-prc on" ("-prc off" is the default) the code for expr() to predict upper() would resemble: if (LA(1)==ID && isU(LATEXT(1)) && LA(1)==ID) { /* a1 */ upper(zzSTR); /* a2 */ } /* a3 */ else { /* a4 */ if (LA(1)==ID && isL(LATEXT(1)) && LA(1)==ID) { /* a5 */ lower(zzSTR); /* a6 */ } /* a7 */ else { /* a8 */ if (LA(1)==NUMBER) { /* a9 */ zzmatch(NUMBER); /* a10 */ } /* a11 */ else /* a12 */ {zzFAIL();goto fail;} /* a13 */ } /* a14 */ } ... ... ******************************************************* *** *** *** Starting with version 1.20: *** *** Predicate tests appear AFTER lookahead tests *** *** *** ******************************************************* Note that each test of LATEXT(i) is guarded by a test of the token type (e.g. "LA(1)==ID && isU(LATEXT(1)"). Page 36 With "-prc off" the code would resemble: if (isU(LATEXT(1)) && LA(1)==ID) { /* b1 */ upper(zzSTR); /* b2 */ } /* b3 */ else { /* b4 */ if (isL(LATEXT(1)) && LA(1)==ID) { /* b5 */ lower(zzSTR); /* b6 */ } /* b7 */ else { /* b8 */ if ( (LA(1)==NUMBER) ) { /* b9 */ zzmatch(NUMBER); /* b10 */ } /* b11 */ else /* b12 */ {zzFAIL();goto fail;} /* b13 */ } /* b14 */ } ... ... Thus when coding the grammar for use with "-prc off" it is necessary to do something like: upper : <<LA(1)==ID && isU(LATEXT(1))>>? ID ; lower : <<LA(1)==ID && isL(LATEXT(1))>>? ID ; This will make sure that if the token is of type NUMBER that it is not passed to isU() or isL() when using "-prc off". So, you say to yourself, "-prc on" is good and "-prc off" is bad. Wrong. Consider the following slightly more complicated example in which the first alternative of rule "expr" contains tokens of two different types: expr : ( upper | NUMBER ) NUMBER | lower | ID ; upper : <<LA(1)==ID && isU(LATEXT(1))>>? ID ; lower : <<LA(1)==ID && isL(LATEXT(1))>>? ID ; number : NUMBER ; With "-prc off" the code would resemble: ... { /* c1 */ if (LA(1)==ID && isU(LATEXT(1)) && /* c2 */ ( LA(1)==ID || LA(1)==NUMBER) ) { /* c3 */ { /* c4 */ if (LA(1)==ID) { /* c5 */ upper(zzSTR); /* c6 */ } /* c7 */ else { /* c8 */ if (LA(1)==NUMBER) { /* c9 */ zzmatch(NUMBER); /* c10 */ } /* c11 */ else {zzFAIL();goto fail;}/* c12 */ } /* c13 */ } ... ... Page 37 Note that if the token is a NUMBER (i.e. LA(1)==NUMBER) then the clause at line c2 ("LA(1)==ID && ...") will always be false, which implies that the test in the "if" statement (lines c2/c3) will always be false. (In other words LA(1)==NUMBER implies LA(1)!=ID). Thus the sub-rule for NUMBER at line c9 can never be reached. With "-prc on" essentially the same code is generated, although it is not necessary to manually code a test for token type ID preceding the call to "isU()". The workaround is to to bypass the heart of the predicate when testing the wrong type of token. upper : <<LA(1)==ID ? isU(LATEXT(1)) : 1>>? ID ; lower : <<LA(1)==ID ? isL(LATEXT(1)) : 1>>? ID ; Then with "-prc off" the code would resemble: ... { /* d1 */ if ( (LA(1)==ID ? isU(LATEXT(1)) : 1) && /* d2 */ (LA(1)==ID || LA(1)==NUMBER) ) { /* d3 */ ... ... With this correction the body of the "if" statement is now reachable even if the token type is NUMBER - the "if" statement does what one wants. With "-prc on" the code would resemble: ... /* e1 */ if (LA(1)==ID && /* e2 */ (LA(1)==ID ? isU(LATEXT(1)) : 1) && /* e3 */ (LA(1)==ID || LA(1)==NUMBER) ) { /* e4 */ ... ... Note that the problem of the unreachable "if" statement body has reappeared because of the redundant test ("e2") added by the predicate computation. The lesson seems to be: when using rules which have alternatives which are "visible" to ANTLR (within the lookahead distance) that have different token types it is probably dangerous to use "-prc on". Page 38 101. You cannot use downward inheritance to pass parameters to semantic predicates which are NOT validation predicates. The problem appears when the semantic predicate is hoisted into a parent rule to predict which rule to call: For instance: a : b1 [flag] | b2 | b3 b1 [int flag] : <<LA(1)==ID && flag && hasPropertyABC (LATEXT(1))>>? ID ; b2 : : <<LA(1)==ID && hasPropertyXYZ (LATEXT(1))>>? ID ; b3 : ID ; When the semantic predicate is evaluated within rule "a" to determine whether to call b1, b2, or b3 the compiler will discover that there is no variable named "flag" for procedure "a()". If you are unlucky enough to have a variable named "flag" in a() then you will have a VERY difficult-to-find bug. The -prc option has no effect on this behavior. It is possible that an init-action will inhibit the hoisting of the predicate and make this code work. I have not verified this with versions 1.2x. 102. Another reason why semantic predicates must not have side effects is that when they are hoisted into a parent rule in order to decide which rule to call they will be invoked twice: once as part of the prediction and a second time as part of the validation of the rule. Consider the example above of upper and lower. When the input does in fact match "upper" the routine isU() will be called twice: once inside expr() to help predict which rule to call, and a second time in upper() to validate the prediction. If the second test fails the macro zzpred_fail() is called. As far as I can tell, there is no simple way to disable the use of a semantic predicate for validation after it has been used for prediction. Page 39 103. I had a problem in which I needed to do a limited amount of lookahead, but didn't want to use all the machinery of syntactic predicates. I found that I could enlarge the set of expressions accepted by "expr" and then look at the AST created in order to determined what rules could follow: cast_expr /* a1 */ : <<int isCast=0;>> /* a2 */ /* a3 */ LP! predefined_type RP! cast_expr /* a4 */ <<#0=#(toAST(T_cast),#0);>> /* a5 */ | LP! expr0 RP! /* a6 */ <<if ((#2->token)==T_class_name) { /* a7 */ isCast=1; /* a8 */ } else { /* a9 */ isCast=0; /* a10 */ }; /* a11 */ >> /* a12 */ ( <<;>> <<isCast==1>>? /* a13 */ <<printf ("\nIs cast expr\n");>> /* a14 */ cast_expr /* a15 */ <<#0=#(toAST(T_cast),#0);>> /* a16 */ /* a17 */ | <<printf ("\nIs NOT cast expr\n");>> /* a18 */ () /* empty */ /* a19 */ ) /* a20 */ | unary_expr /* a21 */ Later on I gave up on this approach and decided to use syntactic predicates anyway. It not only solved this problem, but others where it was more difficult to patch up the grammar. I can't bring myself to remove the example, though. Page 40 =============================================================================== Section on Syntactic Predicates (also known as "Guess Mode") ------------------------------------------------------------------------------- 104. The terms "infinite lookahead", "guess mode","syntactic predicates" are all equivalent. Sometimes the term "backtracking" is used as well, although " backtracking" can sometimes be used to discuss lexing and DLG as well. The term "syntactic predicate" emphasizes that is handled by the parser. The term "guess mode" emphasizes that the parser may have to backtrack. The term "infinite lookahead" emphasizes the implementation in ANTLR: the entire input is read, processed, and tokenized by DLG before ANTLR begins parsing. 105. An expression in a syntactic predicate should not have side-effects. If there is no match then the rule using the syntactic predicate won't be executed. 106. When using syntactic predicates the entire input buffer is read and tokenized by DLG before parsing by ANTLR begins. If a "wrong" guess requires that parsing be rewound to an earlier point all attributes that were creating during the "guess" are destroyed and the parsing begins again and it creates new attributes at it reparses the (previously) tokenized input. 107. In infinite lookahead mode the line and column information is hopelessly out-of-sync because zzline will contain the line number of the last line of input - the entire input was parsed before scanning was begun. The line and column information is not restored during backtracking. To keep track of the line information in a meaningful way one has to use the ZZINF_LINE macro which was added to pccts in version 1.20. Putting line and column information in a field of the attribute will not help. The attributes are created by ANTLR, not DLG, and when ANTLR backtracks it destroys any attributes that were created in making the incorrect guess. 108. As infinite lookahead mode causes the entire input to be scanned by DLG before ANTLR begins parsing, one cannot depend on feedback from the parser to the lexer to handle things like providing special token codes for items which are in a symbol table (the "lex hack" for typedefs in the C language). Instead one MUST use semantic predicates which allow for such decisions to be made by the parser. 109. One cannot use an interactive scanner (ANTLR -gk option) with the ANTLR infinite lookahead and backtracking options (syntactic predicates). 110. An example of the need for syntactic predicates is the case where relational expressions involving "<" and ">" are enclosed in angle bracket pairs. Relation: a < b Array Index: b <i> Problem: a < b<i> vs. b < a> I was going to make this into an extended example, but I haven't had time yet. 111. Version 1.20 fixes a problem in 1.10 in which ASTs were constructed during guess mode. In version 1.10 care had to be taken to deallocate the ASTs that were created in the rules which were invoked in guess mode. Page 41 112. The following is an example of the use of syntactic predicates. program : ( s SEMI )* ; s : ( ID EQUALS )? ID EQUALS e | e ; e : t ( PLUS t | MINUS t )* ; t : f ( TIMES f | DIV f )* ; f : Num | ID | "$" e "$" ; When compiled with k=1: antlr -fe err.c -fh stdpccts.h -fl parser.dlg -ft tokens.h \ -fm mode.h -k 1 test.g One gets the following warning: warning: alts 1 and 2 of the rule itself ambiguous upon { ID } even though the manual suggests that this is okay. The only problem is that ANTLR 1.10 should NOT issue this error message unless the -w2 option is selected. Included with permission of S. Salters Page 42 =============================================================================== Section on Inheritance ------------------------------------------------------------------------------- 113. A rule which uses upward inheritance: rule > [int result] : x | y | z; Is simply declaring a function which returns an "int" as a function value. If the function has more than one item passed via upward inheritance then ANTLR creates a structure to hold the result and then copies each component of the structure to the upward inheritance variables. 114. When writing a rule that uses downward inheritance: rule [int *x] : r1 r2 r3 one should remember that the arguments passed via downward inheritance are simply arguments to a function. If one is using downward inheritance syntax to pass results back to the caller (really upward inheritance !) then it is necessary to pass the address of the variable which will receive the result. 115. ANTLR is smart enough to combine the declaration for an AST with the items declared via downward inheritance when constructing the prototype for a function which uses both ASTs and downward inheritance. Page 43 =============================================================================== Section on LA, LATEXT, NLA, and NLATEXT ------------------------------------------------------------------------------- 116. Need examples of LATEXT for various forms of lookahead. =============================================================================== Section on Prototypes ------------------------------------------------------------------------------- 117. Prototype for typical create_attr routine: #define zzcr_attr(attr,token,text) \ create_attr(attr,token,text) void create_attr (Attrib *attr,int token,char *text); 118. Prototype for a typical create_ast routine invoked to automatically construct an AST from an attribute: #define zzcr_ast(ast,attr,tok,astText) \ create_ast(ast,attr,tok,text) void create_ast (AST *ast,Attr *attr,int tok,char *text); 119. Prototype for a typical make_ast routine invoked by the #[...] notation. AST *zzmk_ast (AST *ast,int token,char *text) 120. Prototype for a typical zzd_ast macro which is invoked when destroying an AST node: #define zzd_ast(node) delete_ast(node) void delete_ast (AST * node); 121. Prototype for zzdef0 macro to initialize $0 of a rule: #define zzdef0(attr) define_attr_0 (attr) void define_attr_0 (Attrib *attr); 122. Prototype for ANTLR (these are actually macros): read from file: void ANTLR (void startRule(...),FILE *) read from string: void ANTLRs (void startRule(...),zzchar_t *) read from function: void ANTLRf (void startRule(...),int (*)()) read from file: void ANTLRm (void startRule(...),FILE *,int lexclass) In the call to ANTLRf the function behaves like getchar() in that it returns EOF (-1) to indicate end-of-file. If ASTs are used or there is downward or upward inheritance then the call to the startRule must pass these arguments: AST *root; ANTLRf (startRule(&root),stdin); Page 44 =============================================================================== Section on ANTLR/DLG Internals Routines That Might Be Useful ------------------------------------------------------------------------------- **************************** **************************** ** ** ** Use at your own risk ** ** ** **************************** **************************** 123. static int zzauto - defined in dlgauto.h Current DLG mode. This is used by zzmode() only. 124. void zzerr (char * s) defined in dlgauto.h Defaults to zzerrstd(char *s) in dlgauto.h Unless replaced by a user-written error reporting routine: fprintf(stderr, "%s near line %d (text was '%s')\n", ((s == NULL) ? "Lexical error" : s), zzline,zzlextext); 125. static char zzebuf[70] defined in dlgauto.h Page 45 =============================================================================== Section on Known Minor Bugs in pccts ------------------------------------------------------------------------------- 126. In the 8-Aug-94 distribution of version 1.21, line 434 of err.h should be changed: from: zzline = zzinf_line[zzinf_line]; to: zzline = zzinf_line[zzinf_labase]; Diagnosed (and fix) reported by Anders Sterholm (anvil@lysator.liu.se) 127. With version 1.20 (1-Apr-94) of ANTLR the -w2 switch will sometimes cause items to be reported as lacking an associated regular expression when they actually do have an associated regular expression. Sometimes #lexclass names and #tokclass names will also be reported. This is reported as fixed in 1.21. 128. The UPDATE.120 of (1-Apr-94) reports that there are problems in combining guess mode and semantic predicates under some circumstances. 129. In version 1.20 there is a bug in the 1-Apr-94 distribution of routine _zzsetmatch() which appears when used with ANTLR -gk (demand lookahead. This is reported as fixed in version 1.21. 130. In version 1.20 a rule which returns more than two arguments (via upward inheritance) generates code which cannot be compiled by a strict K&R compiler. It is accepted by ANSI compilers, however. This is reported as fixed in version 1.21. Reported by Joachim Schrod (schrod@iti.informatik.th-darmstadt.de). 131. There appears to be a bug in 1.20 and 1.21 when using syntactic predicates. Sometimes attributes are corrupted by backtracking when k=1 under a combination of circumstances which are difficult to describe. A patch for this problem is reported in netnews note #421 and the problem will be fixed in release 1.22. This problem was reported by Sanjay Ghemawat (Sanjay@lcs.mit.edu) Page 46 132. The following problem was first reported by Carle Patrice (carle@litp.ibp.fr) in netnews note #347. TJP describes the problem in netnews note #428: It is related to lookahead computation for k=3 with a hoisted predicate. I have reduced the error to this case: a : (A B | B C) | b ; b : <<pred-c>>? B B ; It did not compute the correct lookahead decision for alternative one in rule "a". It should generate: if ( (LA(1)==A || LA(1)==B) && (LA(2)==B || LA(2)==C) && !((LA(1)==B&&(LA(2)==B))) ) { ... } else blah *NOT* simply: if ( (LA(1)==A || LA(1)==B) && (LA(2)==B || LA(2)==C) ) { ... } else blah This will be fixed in version 1.22. Page 47 133. In version 1.10 there was a bug in the hoisting of semantic predicates when all alternatives of a rule had the same lookahead token. I believe this has been fixed in versions >= 1.20. Consider the following (compiled with -prc off -k 1 -ck 3): ;obj_name : global_func_id | simple_class SUFFIX_DOT | ID ;simple_class : <<(LA(1)==ID ? isClass(LATEXT(1)) : 1)>>? ID ;global_func_id : <<(LA(1)==ID ? isFunction(LATEXT(1)) : 1)>>? ID In version 1.2x the generated code resembles: void obj_name(void) { if ( (LA(1)==ID) && (LA(1)==ID ? isFunction(LATEXT(1)) : 1)) ) { global_func_id(); } else { if ( (LA(1)==ID) && (LA(2)==SUFFIX_DOT) ) { simple_class(); } else { if ( (LA(1)==ID) ) { zzmatch(ID); ... The workaround, previously, was to precede semantic predicates with an action (such as "<<;>>") which inhibits hoisting into the prediction expression. Page 48 #header << #include "charptr.h" >> #token QUESTION #token COMMA #token ID #token Eof "@" #token SUFFIX_DOT statement : (information_request)? | obj_name ;information_request : QUESTION id_list ( QUESTION )* | id_list ( QUESTION )+ ;id_list : obj_name ( COMMA | obj_name )* ;simple_class : <<(LA(1)==ID ? isClass(LATEXT(1)) : 1)>>? ID ;global_func_id : <<(LA(1)==ID ? isFunction(LATEXT(1)) : 1)>>? ID ;obj_name : <<;>> global_func_id | <<;>> simple_class SUFFIX_DOT | ID ; Page 49 =============================================================================== Ideas on the Construction of ASTs and their use with Sorcerer ------------------------------------------------------------------------------- Consider the problem of a grammar which would normally require two passes through the source code to properly analyze. In some cases it is convenient to perform a first pass which creates AST trees and perform the second pass by analyzing the AST trees with Sorcerer. 1) Define an AST node that contains the information you'll need in the second pass. For example, /* * Parse trees are represented by an abstract-syntax-tree (AST) * (forward declare the pointer here). Refer to parse.h for description * of parse_info. */ typedef struct parse_struct *ast_ref; /* parser attributes ($-symbols) & AST nodes */ typedef struct parse_struct *pinfo_ref; /* * the parse structure is used to describe both attributes and * AST nodes */ struct parse_struct { pinfo_ref right; /* points to siblings */ pinfo_ref down; /* points to children */ int token; /* token number (see tokens.h) */ char *text; /* input text */ src_pos pos; /* position in source file */ object_ref obj; /* object description (id's) */ type_ref typ; /* type description (expr's) */ const_value value; /* value of a constant expression */ } ; /* * define Abstract Syntax Tree (AST) nodes */ /* ast_ref was forward-defined */ typedef struct parse_struct AST; /* * the Pass-1 (parse phase) parse-attributes ($-variables) * have the same structure as an AST node. */ typedef struct parse_struct Attrib, *Attrib_ref; In the code above, the parse-attribute was defined to have the same structure as an AST node. This isn't a requirement, but just makes it easier to pass information produced in the first pass on to subsequent passes. Page 50 2) Have the first pass build a symbol table as it parses the input, perform semantic checks, and build an AST. Use the -gt (generate tree) option on ANTLR, and override the automatically generated tree construction operations as needed. For example, var_declare: << pvec_ref v_list; int i; boolean has_var_section = FALSE; >> VAR^ ( var_id_list > [v_list] COLON { extern_kw | static_kw } type << for (i = 0; i < v_list->len; ++i) { object_ref v = (object_ref) v_list->val[i]; define_var(v, $4.typ); } >> { ASSIGNMENT expr << mark_var_use(#2, VAR_RHS); >> } SEMI << free_pvec(v_list); >> )+ ; var_id_list > [pvec_ref v_list]: << object_ref this_var; $v_list = new_pvec(); >> ID << this_var = new_var_id(&$1); if (this_var != NULL) append_pvec($v_list, (void *)this_var); >> ( COMMA ID << this_var = new_var_id(&$2); if (this_var != NULL) append_pvec($v_list, (void *)this_var); >> )* ; The "pvec" stuff above is just a vector of pointers that can be extended automatically. A linked list would work just as well. The idea is that we must first collect the declared variables, then parse the type declaration, then apply bind the type to the declared variables. We used ANTLR's auto-tree-generation mode, and didn't override its actions with our own. Therefore, the following Sorcerer fragment will recognize the AST built for a variable declaration: Page 51 var_declare: #( VAR ( v_list: var_id_list COLON { EXTERN | STATIC } type { ASSIGNMENT expr } SEMI )+ ) ; var_id_list: ID ( COMMA ID)* ; Here's an example, where we use explicit rules to build an AST: expr!: simple_expr << $expr = $1; #0 = #1; >> ( rel_op simple_expr << parse_binary_op(&$expr, &$1, &$2); #0 = #(NULL, #0, #1, #2); >> )* << $expr.token = EXPR; $expr.text = "expr"; #0 = #(#[&$expr], #0); >> ; The construct, #[&$expr] first takes the address of the $expr attribute (attributes are structures, not pointers, in this example), and then applies the #[] operation which makes a call to the routine that creates an AST node, given an attribute (or attribute address in our case). It takes a while to get the hang of where the &'s #'s, and $'s go, but can be a real time-saver once you master it. What we're doing above is building a special EXPR (expression) node. This node would be parsed as follows in subsequent passes, using Sorcerer: expr: #( e: EXPR l_oprnd: simple_expr << e->typ = l_oprnd->typ; >> (op: rel_op r_oprnd: simple_expr << e->typ = std_bool_type->obj_type; if (op->token == IN) { /* no type conversion checking for IN. * try to rewrite simple IN ops. */ if (is_simple_in_op(l_oprnd, op, r_oprnd)) { rewrite_simple_in_op(l_oprnd, op, r_oprnd); } } else { cvt_term(&l_oprnd, op, r_oprnd, _t); } >> )* ); Page 52 We left in the actual actions of the second (Sorcerer driven) pass. Notice how the Sorcerer grammar labels various parts of the expr node ('e', 'l_oprnd', 'op', and 'r_oprnd'). This gives the second pass access to each node, as it is recognized. The second pass uses the 'typ' field, which contains the type of the ID, expression, or literal parsed by the first pass. In the actions above, we are propagating additional type information (for example, the result of a relational op is always a boolean, checking for implicit type conversions, and handling simple cases of Pascal's IN operation). The fragment above is from a Pascal to Ada translator, so the translator has to make Pascal's implicit type conversions between integer and real into explicit Ada type conversions, and has to convert operations on sets (i.e. IN) into operations on packed boolean arrays, in Ada, or calls to runtime routines. Sometimes when you are building the AST for a given construct, you need to use information gained from semantic analysis. An example is the "assignment" or "call" statement: /* * If a variable access appears alone, then it must be either a call to * procedure with no parameters, or an indirection through a pointer * to a procedure with no parameters. */ assign_or_call_stmt!: << type_ref r_type = NULL; ast_ref v; >> variable << v = #1; $assign_or_call_stmt = $1; $assign_or_call_stmt.token = PROC_CALL; $assign_or_call_stmt.text = "proc_call"; r_type = $assign_or_call_stmt.typ; if (v != NULL && v->obj != NULL && v->obj->obj_result != NULL && v->obj->obj_kind == func_obj && v->down->token == ID && v->down->right == NULL) { object_ref func = v->obj; /* function name used on left hand side; * convert to reference to the function's return value */ v->obj = func->obj_result; v->typ = func->obj_result->obj_type; v->down->text = func->obj_result->obj_name; v->down->obj = func->obj_result; v->down->typ = func->obj_result->obj_type; } #0 = v; >> Page 53 { ASSIGNMENT expr << $assign_or_call_stmt.token = ASSIGNMENT; $assign_or_call_stmt.text = ":="; mark_var_use(#2, VAR_RHS); mark_var_use(v, VAR_LHS); #0 = #(NULL, #0, #[&$1], #2); >> | ( LPAREN actual_param_list RPAREN << mark_actual_param_use(#2, r_type); mark_var_use(v, VAR_RHS); #0 = #(NULL, #0, #[&$1], #2, #[&$3]); >> ) } << #0 = #( #[&$assign_or_call_stmt], #0); >> ; The problem we're solving is that both an assignment statement and a procedure call statement begin with a "variable". Since ANTLR is LL-based, this statement construct is "ambiguous" in that both statement types (assignment and call) begin with the same non-terminal. A "variable" includes operations such operations as array subscripting, pointer deferencing, and record field selection. Thus, a "variable" may comprise an arbitrary number of tokens. We might use syntactic predicates as a form a look-ahead to resolve the two cases above, but instead I decided to make the assumption that we have "PROC_CALL", and to correct that "guess" once we see the assignment operation. Thus, the above rule will build one of the following two AST structures: assign_stmt: #( ASSIGNMENT variable ASSIGNMENT expr ) ; call_stmt: #( PROC_CALL variable {LPAREN actual_param_list RPAREN} ) ; In your AST, you might want to drop unnecessary syntactic tokens such as ASSIGNMENT, LPAREN, RPAREN, COMMA, COLON, etc. We kept them, because we thought it would be necessary for certain parts of source-to-source translation. We don't think that's true, any more, but have not gone back and changed the AST structure either. Page 54 3) Build a separate Sorcerer grammar file to recognize the AST that you have built, and then add your second pass actions. These actions will access fields in the AST node, that were filled in by the first pass. For example, identifiers will probably have an "object_ref" that points to the object named by the identifier, and expression (EXPR) nodes will have a "typ" field that gives the expression's type. You might also add a "value" field that gives the value of a literal, named literal, or statically evaluated constant expression. See the code fragments above for some ideas on how this is done. Conclusions: 1) You'll need an ANTLR (.g) description for pass1, and a separate Sorcerer (.sor) description for pass2. Often the pass2 AST representation is much more regular and well-formed than the original text token stream used by pass1. 2) It can be a bit intimidating putting the pieces together. Try it incrementally, trying a small subset of your larger problem. 3) There are a lot of ways to go with how you represent attributes ($-variables), AST nodes, and the things that go on in various passes. For example, you might have pass1 simply build the AST and perform *no* symbol definitions or semantic checks. Then pass2 might walk the tree and build the symbol, and make various checks. Pass2 might also disambiguate cases that look syntactically similar, and can only be disambiguated using symbol definitions. Then, you could have a pass3 (another Sorcerer driven tree-walk) that does the 'real work' of your compiler/translator. Contributed by Gary Funck (gary@intrepid.com) Page 55 =============================================================================== Example 1 of #lexclass =============================================================================== Borrowed code ------------------------------------------------------------------------------- /* * Various tokens */ #token "[\t\ ]+" << zzskip(); >> /* Ignore whitespace */ #token "\n" << zzline++; zzskip(); >> /* Count lines */ #token "\"" << zzmode(STRINGS); zzmore(); >> #token "'" << zzmode(CHARACTERS); zzmore(); >> #token "/\*" << zzmode(COMMENT); zzskip(); >> #token "//" << zzmode(CPPCOMMENT); zzskip(); >> /* * C++ String literal handling */ #lexclass STRINGS #token STRING "\"" << zzmode(START); >> #token "\\\"" << zzmore(); >> #token "\\n" << zzreplchar('\n'); zzmore(); >> #token "\\r" << zzreplchar('\r'); zzmore(); >> #token "\\t" << zzreplchar('\t'); zzmore(); >> #token "\\[1-9][0-9]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 10)); zzmore(); >> #token "\\0[0-7]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 8)); zzmore(); >> #token "\\0x[0-9a-fA-F]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 16)); zzmore(); >> #token "\\~[\n\r]" << zzmore(); >> #token "[\n\r]" << zzline++; zzmore(); /* Print warning */ >> #token "~[\"\n\r\\]+" << zzmore(); >> /* * C++ Character literal handling */ #lexclass CHARACTERS #token CHARACTER "'" << zzmode(START); >> #token "\\'" << zzmore(); >> #token "\\n" << zzreplchar('\n'); zzmore(); >> #token "\\r" << zzreplchar('\r'); zzmore(); >> #token "\\t" << zzreplchar('\t'); zzmore(); >> #token "\\[1-9][0-9]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 10)); zzmore(); >> #token "\\0[0-7]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 8)); zzmore(); >> #token "\\0x[0-9a-fA-F]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 16)); zzmore(); >> #token "\\~[\n\r]" << zzmore(); >> #token "[\n\r]" << zzline++; zzmore(); /* Print warning */ >> #token "~[\'\n\r\\]" << zzmore(); >> Page 56 /* * C-style comment handling */ #lexclass COMMENT #token "\*/" << zzmode(START); zzskip(); >> #token "~[\*]*" << zzskip(); >> #token "\*~[/]" << zzskip(); >> /* * C++-style comment handling */ #lexclass CPPCOMMENT #token "[\n\r]" << zzmode(START); zzskip(); >> #token "~[\n\r]" << zzskip(); >> #lexclass START /* * Assorted literals */ #token OCT_NUM "0[0-7]*" #token L_OCT_NUM "0[0-7]*[Ll]" #token INT_NUM "[1-9][0-9]*" #token L_INT_NUM "[1-9][0-9]*[Ll]" #token HEX_NUM "0[Xx][0-9A-Fa-f]+" #token L_HEX_NUM "0[Xx][0-9A-Fa-f]+[Ll]" #token FLOAT_NUM "([1-9][0-9]*{.[0-9]*} | {0}.[0-9]+) {[Ee]{[\+\-]}[0-9]+}" /* * Identifiers */ #token Identifier "[_a-zA-Z][_a-zA-Z0-9]*" Page 57 =============================================================================== Example 2: ASTs =============================================================================== #header << #include "charbuf.h" #include <string.h> int nextSerial; #define AST_FIELDS int token; int serial; char *text; #include "ast.h" #define zzcr_ast(ast,attr,tok,astText) \ (ast)->token=tok; \ (ast)->text=strdup( (char *) &( ( (attr)->text ) ) ); \ nextSerial++; \ (ast)->serial=nextSerial; \ #define zzd_ast(node) delete_ast(node) void delete_ast (AST *node); >> << AST *root=NULL; void show(AST *tree) { if (tree->token==ID) { printf (" %s <#%d> ", tree->text,tree->serial);} else { printf (" %s <#%d> ", zztokens[tree->token], tree->serial); }; } void before (AST *tree) { printf ("("); } void after (AST *tree) { printf (")"); } void delete_ast(AST *node) { printf ("\nzzd_ast called for <node #%d>\n",node->serial); free (node->text); return; } int main() { nextSerial=0; ANTLR (expr(&root),stdin); printf ("\n"); zzpre_ast(root,show,before,after); printf ("\n"); zzfree_ast(root); return 0; } >> Page 59 #token WhiteSpace "[\ \t]" <<zzskip();>> #token ID "[a-z A-Z]*" #token NEWLINE "\n" #token OpenAngle "<" #token CloseAngle ">" expr : (expr0 NEWLINE) ;expr0 : expr1 {"="^ expr0} ;expr1 : expr2 ("\+"^ expr2)* ;expr2 : expr3 ("\*"^ expr3)* ;expr3 : ID ------------------------------------------------------------------------------- Sample output from this program: a=b=c=d ( = <#2> a <#1> ( = <#4> b <#3> ( = <#6> c <#5> d <#7> ))) NEWLINE <#8> zzd_ast called for <node #7> zzd_ast called for <node #5> zzd_ast called for <node #6> zzd_ast called for <node #3> zzd_ast called for <node #4> zzd_ast called for <node #1> zzd_ast called for <node #8> zzd_ast called for <node #2> a+b*c ( \+ <#2> a <#1> ( \* <#4> b <#3> c <#5> )) NEWLINE <#6> zzd_ast called for <node #5> zzd_ast called for <node #3> zzd_ast called for <node #4> zzd_ast called for <node #1> zzd_ast called for <node #6> zzd_ast called for <node #2> a*b+c ( \+ <#4> ( \* <#2> a <#1> b <#3> ) c <#5> ) NEWLINE <#6> zzd_ast called for <node #3> zzd_ast called for <node #1> zzd_ast called for <node #5> zzd_ast called for <node #2> zzd_ast called for <node #6> zzd_ast called for <node #4> Page 61 =============================================================================== Example 3: Syntactic Predicates =============================================================================== Not completed. =============================================================================== Example 4: DLG input function =============================================================================== This example demonstrates the use of a DLG input function to work around a limitation of DLG. In this example the user wants to recognize an exclamation mark as the first character of a line and treat it differently from an exclamation mark elsewhere. The workaround is for the input function to return a non-printing character (binary 1) when it finds an "!" in column 1. If it reads a genuine binary 1 in column 1 of the input text it returns a "?". The parse is started by: int DLGchar (void); ... ANTLRf (expr(&root),DLGchar); ... ------------------------------------------------------------------------------- #token BANG "!" #token BANG_COL1 "\01" #token WhiteSpace "[\ \t]" <<zzskip();>> #token ID "[a-z A-Z]*" #token NEWLINE "\n" expr! : (bang <<printf ("\nThe ! is NOT in column 1\n");>> | bang1 <<printf ("\nThe ! is in column 1\n");>> | id <<printf ("\nFirst token is an ID\n");>> )* "@" ;bang! : BANG ID NEWLINE ;bang1! : BANG_COL1 ID NEWLINE ;id! : ID NEWLINE ; ------------------------------------------------------------------------------- Page 62 #include <stdio.h> /* Antlr DLG input function - See page 18 of pccts 1.00 manual */ static int firstTime=1; static int c; int DLGchar (void) { if (feof(stdin)) { return EOF; }; if (firstTime || c=='\n') { firstTime=0; c=fgetc(stdin); if (c==EOF) return (EOF); if (c=='!') return ('\001'); if (c=='\001') return ('?'); return (c); } else { c=fgetc(stdin); return (c); }; }; Page 63 =============================================================================== Example 5: Maintaining a Stack of DLG Modes =============================================================================== Contributed by David Seidel Normally this should be appended to err.h ------------------------------------------------------------------------------- #define MAX_MODE ???? #define ZZMAXSTK (MAX_MODE * 2) static int zzmstk[ZZMAXSTK] = { -1 }; static int zzmdep = 0; void #ifdef __STDC__ zzmpush( int m ) #else zzmpush( m ) int m; #endif { if(zzmdep == ZZMAXSTK - 1) { sprintf(zzebuf, "Mode stack overflow "); zzerr(zzebuf); } else { zzmstk[zzmdep++] = zzauto; zzmode(m); } } void zzmpop() { if(zzmdep == 0) { sprintf(zzebuf, "Mode stack underflow "); zzerr(zzebuf); } else { zzmdep--; zzmode(zzmstk[zzmdep]); } } Page 64 ------------------------------------------------------------------------------- A modified version of the above routine which allows the user to pass a "lookahead" routine to be called when the mode stack is popped. Normally this will be appended to err.h. ------------------------------------------------------------------------------- #define ZZMAXSTK ???? static int zzmstk[ZZMAXSTK] = { -1 }; /* stack of DLG modes */ static void (*zzfuncstk[ZZMAXSTK])(); /* stack of pointer to functions */ static int zzmdep = 0; void pushMode( int m ,void (*func)()) { if(zzmdep == ZZMAXSTK - 1) { sprintf(message, "Mode stack overflow "); zzerr(message); } else { zzmstk[zzmdep] = zzauto; zzfuncstk[zzmdep] = func; zzmdep++; zzmode(m); } } void popMode() { void (*thisFunc)(); if(zzmdep == 0) { sprintf(message, "Mode stack underflow "); zzerr(message); } else { zzmdep--; thisFunc=zzfuncstk[zzmdep]; zzmode(zzmstk[zzmdep]); zzmstk[zzmdep]=0; zzfuncstk[zzmdep]=0; /* this call might result in indirect recursion of popMode() */ if (thisFunc!=0) { (*thisFunc)(); }; } } void resetModeStack() { zzmdep=0; zzmstk[0]=0; zzfuncstk[0]=0; } /* if the lookahead character is a semi-colon then keep on popping */ void popOnSC() { if (zzchar==';') popMode(); } Page 65 =============================================================================== Example 6: Debug code for create_ast, mk_ast, delete_ast to locate lost ASTs =============================================================================== This is an example of code which tries to keep track of lost ASTs using a doubly linked list of all ASTs maintained by calls from create_ast() and mk_ast() to zzastnew_userhook(). When ASTs are deleted by calls to zzastdelete_userhook() from the user's AST delete routines they are removed from the doubly linked list. Any ASTs left over after zzfree_ast() must be considered lost. This method does not monitor ASTs created by zzdup_ast() because it does not call the create_ast() or mk_ast() routines. ------------------------------------------------------------------------------- The #header section must include a definition of AST_FIELDS with the equivalent of: struct _ast *flink, *blink; ------------------------------------------------------------------------------- int main() { ... again: ... reset_ASTlistHead(); /* <======================== */ ANTLR (sourcecode(&root),stdin); treewalk(root); zzfree_ast(root); root=NULL; print_lost_ast(); /* <======================= */ printf ("\n");} ... goto again; ... } ------------------------------------------------------------------------------- #ifndef H_ZZNEWAST_USERHOOK #define H_ZZNEWAST_USERHOOK void reset_ASTlistHead(void); void zzunhook_tree(void); void zzastnew_userhook(AST *newNode); void zzastdelete_userhook (AST *singleNode); void print_lost_ast (void); void treewalk(AST *tree); #endif ------------------------------------------------------------------------------- #include "stdpccts.h" #include "stdlib.h" #include "zzastnew_userhook.h" static AST ASTlistHead; static int ASTserialNumber; void reset_ASTlistHead(void) { while (ASTlistHead.flink!=0 && ASTlistHead.flink!= &ASTlistHead) { zzfree_ast(ASTlistHead.flink); }; ASTlistHead.flink=&ASTlistHead; ASTlistHead.blink=&ASTlistHead; ASTserialNumber=1; return; } Page 66 /* Stop tracking ASTs in a tree without actually deleting them */ void zzunhook_tree (AST * tree) { while (tree != 0) { zzunhook_tree (tree->down); zzastdelete_userhook (tree); tree=tree->right; }; return; } /* Track new AST */ void zzastnew_userhook(AST *newNode) { AST *prev; prev=ASTlistHead.blink; prev->flink=newNode; ASTlistHead.blink=newNode; newNode->blink=prev; newNode->flink=&ASTlistHead; newNode->serialNumber=ASTserialNumber; ASTserialNumber++; return; } /* Stop tracking an AST */ void zzastdelete_userhook (AST *singleNode) { AST *fnode; AST *bnode; if (singleNode!=0) { fnode=singleNode->flink; bnode=singleNode->blink; fnode->blink=bnode; bnode->flink=fnode; singleNode->serialNumber=0; singleNode->flink=0; singleNode->blink=0; }; return; } /* Print ASTs that are still on list */ void print_lost_ast () { AST *node; for (node=ASTlistHead.flink; node!=0 && node!= &ASTlistHead; node=node->flink) { printf ("**** Start of lost AST listing **** %d\n",node->serialNumber); treewalk (node); /* user supplied routine */ printf ("\n**** End of lost AST listing ****\n"); }; } Page 68 ------------------------------------------------------------------------------- These routines print out the AST tree. This will be application dependent. ------------------------------------------------------------------------------- #include "stdpccts.h" #include "stdlib.h" static int treenest=0; void treeindent(int nesting) { int i; for (i=0;i<nesting*2;i++) { printf (" "); }; return; } void treewalk1 (AST *tree) { while (tree != NULL) { treeindent(treenest); printf ("%s",zztokens[tree->token]); if (tree->text != NULL) { printf (" %s",tree->text); }; printf ("\n"); treenest++; treewalk1 (tree->down); treenest--; tree=tree->right; }; return; } void treewalk (AST *tree) { treenest=0; treewalk1(tree); return; } -------------------------------------------------------------------------------